MACHINE LEARNING TUTORIALS

ML Tutorials: Intro to MLOps and ML Engineering

[Enter image here]

According to a report conducted by Gartner, 85% of data science projects fail. In this learning series, we explore why these initiatives fail and how organizations have found ways to overcome setbacks. This tutorial will cover the following learning objectives:

  • Why Do Data Science Initiatives Fail?
  • What is MLOps?
  • What is ML Engineering?
  • How to Become a ML Engineer?

Why Do Data Science Initiatives Fail?




Summary

  • Since the early 2010s, organizations big and small have realized the value of data and what it can do for product development and placement. They go on to hire teams of Data Scientists, Data Analysts, and Data Engineers without having a solid infrastructure in place. This goes on to create massive technical debt in the form of high cloud computing costs and high salaries to be paid.
  • According to the Gartner report mentioned above and in the video, Data Science teams mentioned common issues in the following areas:
    • DATA: Data Science teams mentioned issues with poor data quality, trouble obtaining data from multiple sources, and privacy issues surrounding Personally Identifiable Information (PII) (e.g., Social Security Numbers, Birth Dates, Home Addresses). This led to Data Scientists not being able to trust the data enough to train their models on.
    • HUMAN FACTORS: According to the Gartner Report, over 40% of respondants stated that they felt ML Engineers were not qualified to be working on high-end models in production. This led to a mistrust between team members and a lack of communication between stakeholders and Data Science Teams. Common issues mentioned in the report include the following:
      • Lack of managerial support
      • Lack of clear questions to address
      • Results (extracted from the models) are not used or implemented by the business
      • Difficulty Deploying Models and Coordinating Permissions with IT
      • Difficulty Explaining Data Science Principles to Non-Technical Stakeholders
    • TOOLS: Most Data Science tools used by organizations in the early- to mid-2010s were unable to keep up with the increasing amount of data genrated by businesses. During this time, businesses were using on-premises and open-source tools that were cheap and effective, though were inefficient in working with large amounts of data and doing complex calculations, such as Neural Networks.
    • FINANCING: In the Gartner Report, almost 37% of respondants stated that their projects were unable to be completed due to lack of financial support. Even nowadays, with cloud service providers, organizations often need to set budgets on certain initiatives to prevent any financial losses. However, if these budgets are too tight, the business may never experience the planned ROI.

What is MLOps




Summary

  • Machine Learning Operations (MLOps) is a core function of ML Engineering, focused on streamlining the process of taking ML models to production, and then maintaining and monitoring them.
  • MLOps can best be seen as DevOps principles applied to Machine learning initiatives.
  • MLOps helps organizations achieve the following goals:
    1. Faster experimentation and model development
    2. Faster deployment of updated models into production
    3. Quality assurance
  • The main reason MLOps exists in the first place is to help organizations overcome the most difficult hurdle in any Data Science initiative: DATA. MLOps assists developers in creating a framework to improve data quality, version source data using cleaning techniques, and keep track of metadata that helps Data Scientists built trust in the data.
  • At the core of MLOps is the Machine Learning Lifecycle. We will go into further detail on this in the next tutorial. For the time being, just know that Experimentation refers to the process of selecting/generating features, selecting algorithms, tuning hyperparameters, and fitting the model.
  • Why is MLOps used? It covers the following organizational needs:
    • Learn from your mistakes (re-training/re-building models in an iterative fashion)
    • Track Training and Evaluation Metrics
    • Source Control the code (to ensure dependencies are in tact)
    • Checkpoint Steps in the ML Lifecycle through the use of ML Pipelines
    • Automating proper validation/staged deployment
  • Before MLOps became an industry standard, Data Science teams would regularly deploy models into production assuming it met the business's needs. This would lead to reduced ROI and increase technical debt.
  • It's worth mentioning that MLOps also introduced the idea of Responsible ML, which addressed the issue raised by the respondants in the Gartner Report surrounding Data Privacy concerns. Responsible ML provided a framework for ML Engineers to understand the interpretability and fairness of their models to reduce any outstanding biases. This also protected users and internal stakeholders from any unethical use cases from external hackers.

What is ML Engineering?




Summary

  • A Machine Learning Engineer (ML Engineer) works with Data Scientists to solve business problems using Machine Learning systems. ML Engineers utilize MLOps principles and the ML Lifecycle to design, develop, and deploy Machine Learning applications.
  • ML Engineers have the following responsibilities:
    • Prepare Data. Before any work goes into building a model or implementing an algorithm, data needs to be presented in an interpretable format. This may include changing data types, removing erroneous values, or creating calculated metrics based on what business stakeholders expect.
    • Feature Engineering. Once relevant data has been collected and prepared, independent variables, known as features must be identified. Although, just because a certain attribute can be used to predict an outcome, doesn't mean it should. We will discuss feature engineering best practices in a later tutorial.
    • Algorithm Selection. ML Engineers need to have a moderately deep understanding of not only what algorithms are available, but which one to use based on the type of data, the expected predictive accuracy of the model, and how resource intensive the model is. For instance, if you are creating a classification model with a model that has resource constraints, a Random Forest may be the best option rather than a Neural Network.
    • Model Fitting. ML Engineers must have a deep understanding of overfitting and underfitting. With most supervised models, this can be avoided using basic statistical methods and tests. However, when building a model from scratch it's a bets practice to use various subsets to compare results and validate the evaluation metrics before deploying the model into production.
    • Model Deployment. Once the model has been effectively trained, evaluated, and validated by business stakeholders, it can be put into a production environment. This is typically done using containers and clusters through software systems such as Docker and Kubernetes. Once the application has been containerized, it can be placed into a web or mobile application for users to experience.
    • Model Performance Monitoring and Evaluation. Just because the model and application works in a simulated environment doesn't mean it can keep up in the real world. ML Engineers need to constantly be checking on model performance, checking for instances of model drift and server downtime.
  • So you want to become an ML Engineer? What Background or Skillsets do you need to have? The following is a good baseline to start with:
    • Statistics
    • SQL
    • Data Visualization (Histograms, Scatter Plots, etc.)
    • ML Algorithms and Architectures (e.g., Decision Trees, Support Vector Machines, Naive Bayes, Neural Networks)
    • Python (including the following libraries: Pandas/Polars, SciKit-Learn, TensorFlow, Keras)
    • Apache Spark (for Distributed Computing Frameworks)
  • What's the Difference Between a Data Scientist and an ML Engineer?
    • A Data Scientist is traditionally tasked with communicating with business stakeholders about business needs, conducting A/B tests, and apply statistical methods to business metrics. Although they may use ML models and algorithms to complete this work, they don't deploy them into a production environment, just for their own internal use case.
    • An ML Engineer is traditionally tasked with developing ML applications or microservices built within an existing application. They work alongside Data Scientists most of the time to develop working products to solve business needs and provide business value. Rather than only using ML algorithms and models for internal use cases, ML Engineers use these for external use cases.
  • NOTE: The job title of ML Engineer is essentially the new "entry-level Data Scientist". Because Data Scientists are tasked with having a deep understanding of the business and communicating with both technical and non-technical stakeholders, it's traditional for most Data Scientists job listings to require at least 5 years of work experience.

How to Become an ML Engineer?




Summary

  • Since a large portion of organizational data is stored in a structured format, ML Engineers need to have a strong knowledge of SQL. They don't need to know every nook and cranny, but should be at an intermediate level to perform moderately advanced queries. Looking to get started with or improve your skills in SQL? Check out our SQL Tutorials.
  • As has been mentioned in various videos throughout this Tutorial Series, Python has become the defacto language of Data Science. Thus, it's critical for you to have a good understanding of this programming language. Looking to get started with or improve your skills in Python? Check out our Python Tutorials.
  • The next most valuable skill an ML Engineer can have is that of data cleaning. Similar to why SQL is so critical, Pandas, the most popular Python library for data cleaning, is critical to know because it's built to clean data in a structured format. Looking to get started with or improve your skills in Pandas? Check out our Pandas Tutorials.
  • ML Pipelines are basically templates for training and re-training models in both development and production environments. Some of the most cmoon tools used for ML Pipelines include SciKit-Learn, MLFlor, and KubeFlow. However, MLFlow and KubeFlow are more niche products used within specific domains (particularly companies that sell AI products). Thus, it's wise to get started with SciKit-Learn. SciKit-Learn Tutorials will be available in Late Fall 2023 or Early Spring 2024.
  • NOTE: Although Pandas may be on every job listing for entry-level ML Engineer roles, just know that Polars has joined the chat. Polars is a Rust-based Python library built to be Pandas on steroids. The main limitation with Pandas is its inability to handle large amounts of data. However, since Polars is based on Rust, it's just as capable as Pandas but much faster at cleaning large amounts of data.